本文提出了一种使用视频中心化的变压器在视频中面部聚类的新方法。以前的作品经常采用对比度学习来学习框架级表示,并使用平均池来汇总沿时间维度的特征。这种方法可能无法完全捕获复杂的视频动态。此外,尽管在基于视频的对比学习方面取得了最新进展,但很少有人试图学习一个自我监视的聚类友好的面部表现,从而使视频面部聚集任务受益。为了克服这些局限性,我们的方法采用了变压器直接学习视频级表示,可以更好地反映视频中面部的时间变化属性,而我们还建议一个以视频为中心的自我监督框架来训练变压器模型。我们还调查了以自我为中心视频的面部聚类,这是一个快速出现的领域,尚未在与面部聚类有关的作品中进行研究。为此,我们介绍并发布了第一个名为EasyCom-Clustering的大规模以egipentric视频群集群数据集。我们在广泛使用的大爆炸理论(BBT)数据集和新的easycom群集数据集上评估了我们的建议方法。结果表明,我们以视频为中心的变压器的性能超过了两个基准测试的所有先前最新方法,对面部视频表现出了自我牵强的理解。
translated by 谷歌翻译
公制学习旨在学习一个距离度量,以便在将不同的实例推开时将语义上相似的实例放在一起。许多现有方法考虑在特征空间中最大化或至少限制距离距离的距离,以分离相似和不同的实例对以保证其概括能力。在本文中,我们主张在输入空间中施加对抗边缘,以改善公制学习算法的概括和稳健性。我们首先表明,对抗边缘定义为训练实例与其最接近的对手示例之间的距离,它既考虑了特征空间中的距离差距以及指标和三重限制之间的相关性。接下来,为了增强实例扰动的鲁棒性,我们建议通过最大程度地减少称为扰动损失的新型损失函数来扩大对抗缘。提出的损失可以看作是数据依赖性的正规器,并轻松地插入任何现有的度量学习方法中。最后,我们表明扩大边缘通过使用算法鲁棒性的理论技术对概括能力有益。 16个数据集的实验结果证明了所提出的方法比现有的最新方法具有歧视精度和鲁棒性,以抵抗可能的噪声。
translated by 谷歌翻译
扩张的卷曲广泛用于深度语义分段模型,因为它们可以扩大过滤器的接收领域而不增加额外的权重,也不牺牲空间分辨率。然而,正如扩张的卷积滤波器在语义上有意义的轮廓上没有关于像素的位置知识,它们可能导致对象边界的模糊预测。另外,虽然扩张过滤器可以扩展其接收领域,但是采样像素的总数保持不变,这通常包括一小部分接收领域的总面积。灵感来自人类视觉系统中的横向抑制(LI)机制,我们提出了具有横向抑制(LI-CONVS)的扩张卷积以克服这些限制。介绍锂机制提高了卷积滤波器对语义对象边界的敏感性。此外,由于LI-DIVS也隐含地考虑从横向禁止的区域中的像素考虑,因此它们还可以以密度刻度提取特征。通过将锂致常规集成到Deeplabv3 +架构中,我们提出了横向抑制的不受欢迎的空间金字塔汇集(Li-Aspp),横向抑制的Mobilenet-V2(Li-MnV2)和横向抑制的Reset(Li-Reset)。在三个基准数据集(Pascal VOC 2012,Celebamask-HQ和Ade20k)的实验结果表明,我们的李氏分割模型越来越突出了所有这些的基线,从而验证了拟议的LI-CONN的有效性和一般性。
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
translated by 谷歌翻译
Advances in computer vision and machine learning techniques have led to significant development in 2D and 3D human pose estimation from RGB cameras, LiDAR, and radars. However, human pose estimation from images is adversely affected by occlusion and lighting, which are common in many scenarios of interest. Radar and LiDAR technologies, on the other hand, need specialized hardware that is expensive and power-intensive. Furthermore, placing these sensors in non-public areas raises significant privacy concerns. To address these limitations, recent research has explored the use of WiFi antennas (1D sensors) for body segmentation and key-point body detection. This paper further expands on the use of the WiFi signal in combination with deep learning architectures, commonly used in computer vision, to estimate dense human pose correspondence. We developed a deep neural network that maps the phase and amplitude of WiFi signals to UV coordinates within 24 human regions. The results of the study reveal that our model can estimate the dense pose of multiple subjects, with comparable performance to image-based approaches, by utilizing WiFi signals as the only input. This paves the way for low-cost, broadly accessible, and privacy-preserving algorithms for human sensing.
translated by 谷歌翻译
With the increasing ability of large language models (LLMs), in-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few training examples. It has been a new trend exploring ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress, challenges, and future work in ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques of ICL, including training strategies, prompting strategies, and so on. Finally, we present the challenges of ICL and provide potential directions for further research. We hope our work can encourage more research on uncovering how ICL works and improving ICL in future work.
translated by 谷歌翻译
The node-place model has been widely used to classify and evaluate transit stations, which sheds light on individual travel behaviors and supports urban planning through effectively integrating land use and transportation development. This article adapts this model to investigate whether and how node, place, and mobility would be associated with the transmission risks and presences of the local COVID-19 cases in a city. Similar studies on the model and its relevance to COVID-19, according to our knowledge, have not been undertaken before. Moreover, the unique metric drawn from detailed visit history of the infected, i.e., the COVID-19 footprints, is proposed and exploited. This study then empirically uses the adapted model to examine the station-level factors affecting the local COVID-19 footprints. The model accounts for traditional measures of the node and place as well as actual human mobility patterns associated with the node and place. It finds that stations with high node, place, and human mobility indices normally have more COVID-19 footprints in proximity. A multivariate regression is fitted to see whether and to what degree different indices and indicators can predict the COVID-19 footprints. The results indicate that many of the place, node, and human mobility indicators significantly impact the concentration of COVID-19 footprints. These are useful for policy-makers to predict and monitor hotspots for COVID-19 and other pandemics transmission.
translated by 谷歌翻译
Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work focuses on the former. Previous methods build the network with several modules like CNN, LSTM and Attention. Recent methods combine the Transformer with these modules for better performance. However, it requires tedious optimization skills to train a network composed of mixed modules, making these methods inconvenient to be used in practice. In this paper, we propose to design \emph{pure Transformer-based networks} for deep RL, aiming at providing off-the-shelf backbones for both the online and offline settings. Specifically, the Transformer in Transformer (TIT) backbone is proposed, which cascades two Transformers in a very natural way: the inner one is used to process a single observation, while the outer one is responsible for processing the observation history; combining both is expected to extract spatial-temporal representations for good decision-making. Experiments show that TIT can achieve satisfactory performance in different settings, consistently.
translated by 谷歌翻译
Recently the deep learning has shown its advantage in representation learning and clustering for time series data. Despite the considerable progress, the existing deep time series clustering approaches mostly seek to train the deep neural network by some instance reconstruction based or cluster distribution based objective, which, however, lack the ability to exploit the sample-wise (or augmentation-wise) contrastive information or even the higher-level (e.g., cluster-level) contrastiveness for learning discriminative and clustering-friendly representations. In light of this, this paper presents a deep temporal contrastive clustering (DTCC) approach, which for the first time, to our knowledge, incorporates the contrastive learning paradigm into the deep time series clustering research. Specifically, with two parallel views generated from the original time series and their augmentations, we utilize two identical auto-encoders to learn the corresponding representations, and in the meantime perform the cluster distribution learning by incorporating a k-means objective. Further, two levels of contrastive learning are simultaneously enforced to capture the instance-level and cluster-level contrastive information, respectively. With the reconstruction loss of the auto-encoder, the cluster distribution loss, and the two levels of contrastive losses jointly optimized, the network architecture is trained in a self-supervised manner and the clustering result can thereby be obtained. Experiments on a variety of time series datasets demonstrate the superiority of our DTCC approach over the state-of-the-art.
translated by 谷歌翻译